AITopics | speech recognition

Rivian is rolling out its AI-powered voice assistant

EngadgetMay-12-2026, 15:00:00 GMT

Rivian is rolling out its AI-powered in-vehicle voice assistant with the automaker's latest software update. It will be available to all Rivian Gen 1 and Gen 2 owners paying for the company's Connect+ cellular subscription service, which costs $15 a month or $150 a year, or are in the middle of an active trial. The assistant will also be available on Rivian's upcoming R2 mid-size electric SUV that has recently started production . Rivian is expected to make the first deliveries of the R2 EV's most expensive variant later this spring and to offer its $45,000 base model in 2027. The automaker first announced Rivian Assistant at its inaugural Autonomy and AI day in December 2025, where it said that the assistant will orchestrate different models and choose the best one for the task.

artificial intelligence, social media, speech recognition, (9 more...)

Engadget

Industry: Leisure & Entertainment > Games > Computer Games (0.74)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.65)
Information Technology > Communications > Mobile (0.55)
Information Technology > Communications > Social Media (0.43)

Add feedback

e5b1c0d4866f72393c522c8a00eed4eb-Paper-Conference.pdf

Neural Information Processing SystemsApr-30-2026, 02:56:06 GMT

machine learning, natural language, translation, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Textually Pretrained Speech Language Models

Neural Information Processing SystemsApr-29-2026, 17:50:20 GMT

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available.2

arxiv preprint arxiv, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Europe (0.67)
North America > United States > Minnesota (0.28)
North America > United States > California (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (0.68)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

1b4839ff1f843b6be059bd0e8437e975-Paper-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 22:25:01 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

15c00b5250ddedaabc203b67f8b034fd-Paper.pdf

Neural Information Processing SystemsApr-24-2026, 20:32:15 GMT

machine learning, natural language, translation, (19 more...)

Neural Information Processing Systems

Country:

Europe (1.00)
North America > United States (0.93)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Google now lets you have full conversations with Gemini for Home

EngadgetApr-21-2026, 16:00:00 GMT

The feature is rolling out for all the smart home program's supported languages and regions. Google announced today that it is upgrading the Gemini for Home service with a continued conversations feature. Continued conversation allows a user to have a natural discussion with the Gemini platform without prefacing every follow-up request with the Hey Google prompt. The microphone will remain active on a smart device for a few seconds after the Gemini AI assistant provides its reply. During that window, the lights on the hardware will pulse or glow, indicating that you can keep chatting normally with the chatbot without needing a wake word.

artificial intelligence, chatbot, natural language, (16 more...)

Engadget

Genre: Press Release (0.55)

Industry: Information Technology > Smart Houses & Appliances (0.61)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.52)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.51)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.50)

Add feedback

REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

Neural Information Processing SystemsMar-22-2026, 19:09:42 GMT

Unsupervised automatic speech recognition (ASR) aims to learn the mapping between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text challenging, especially without paired data. In this paper, we propose REBORN, Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. REBORN alternates between (1) training a segmentation model that predicts the boundaries of the segmental structures in speech signals and (2) training the phoneme prediction model, whose input is a segmental structure segmented by the segmentation model, to predict a phoneme transcription. Since supervised data for training the segmentation model is not available, we use reinforcement learning to train the segmentation model to favor segmentations that yield phoneme sequence predictions with a lower perplexity. We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech. We comprehensively analyze why the boundaries learned by REBORN improve the unsupervised ASR performance.

artificial intelligence, proceedings, speech recognition, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.96)

Add feedback

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Neural Information Processing SystemsMar-22-2026, 04:38:21 GMT

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the ''Aligner-Encoder''. To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention---it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform ''self-transduction''.

artificial intelligence, machine learning, proceedings, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.59)
Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography

Neural Information Processing SystemsMar-21-2026, 22:53:23 GMT

Surface electromyography (sEMG) non-invasively measures signals generated by muscle activity with sufficient sensitivity to detect individual spinal neurons and richness to identify dozens of gestures and their nuances. Wearable wrist-based sEMG sensors have the potential to offer low friction, subtle, information rich, always available human-computer inputs. To this end, we introduce emg2qwerty, a large-scale dataset of non-invasive electromyographic signals recorded at the wrists while touch typing on a QWERTY keyboard, together with ground-truth annotations and reproducible baselines. With 1,135 sessions spanning 108 users and 346 hours of recording, this is the largest such public dataset to date. These data demonstrate non-trivial, but well defined hierarchical relationships both in terms of the generative process, from neurons to muscles and muscle combinations, as well as in terms of domain shift across users and user sessions. Applying standard modeling techniques from the closely related field of Automatic Speech Recognition (ASR), we show strong baseline performance on predicting key-presses using sEMG signals alone. We believe the richness of this task and dataset will facilitate progress in several problems of interest to both the machine learning and neuroscientific communities.

machine learning, proceedings, speech recognition, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.59)
Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Neural Information Processing SystemsMar-21-2026, 21:21:13 GMT

There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.

artificial intelligence, natural language, proceedings, (7 more...)

Neural Information Processing Systems

Genre: Research Report (0.60)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.87)

Add feedback

Filters

Collaborating Authors

speech recognition

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Rivian is rolling out its AI-powered voice assistant

e5b1c0d4866f72393c522c8a00eed4eb-Paper-Conference.pdf

Textually Pretrained Speech Language Models

1b4839ff1f843b6be059bd0e8437e975-Paper-Conference.pdf

15c00b5250ddedaabc203b67f8b034fd-Paper.pdf

Google now lets you have full conversations with Gemini for Home

REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation